Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
作者信息:
中山大学 Wuhui Chen老师
链接:[2502.06888] Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
摘要:
Mixture of Experts (MoE), with its distinctive sparse structure, enables the scaling of language models up to trillions of parameters without significantly increasing computational costs. However, the substantial parameter size presents a challenge for inference, as the expansion in GPU memory cannot keep pace with the growth in parameters【GPU内存跟不上MoE参数的增长】. Although offloading techniques utilise memory from the CPU and disk and parallelise the I/O and computation for efficiency, the computation for each expert in MoE models is often less than the I/O, resulting in numerous bubbles in the pipeline【采用Offloading必然会带来很大的通信开销】.
Therefore, we propose Klotski, an efficient MoE inference engine that significantly reduces pipeline bubbles through a novel expert-aware multi-batch pipeline paradigm. The proposed paradigm uses batch processing to extend the computation time of the current layer to overlap with the loading time of the next layer. Although this idea has been effectively applied to dense models, more batches may activate more experts in the MoE, leading to longer loading times and more bubbles. Thus, unlike traditional approaches, we balance computation and I/O time and minimise bubbles by orchestrating their inference orders based on their heterogeneous computation and I/O requirements and activation patterns under different batch numbers. Moreover, to adapt to different hardware environments and models, we design a constraint-sensitive I/O-compute planner and a correlation-aware expert prefetcher for a schedule that minimises pipeline bubbles. Experimental results demonstrate that Klotski achieves a superior throughput-latency trade-off compared to state-of-the-art techniques, with throughput improvements of up to 85.12x.
故事
MoE模型增长的很快,GPU无法放下完整的MoE。
-> 利用offloading
-> 然而offloading通信时间比计算时间长很多,存在气泡问题(图1 a)
1. Inter-layer bubbles:Attn和Load Expert之间的气泡,Attn计算完了,Expert还没通信完
2. Intra-layer bubbles:Expert计算和IO之间的不平衡产生的气泡,该Expert计算完了,需要等其他Expert Load进来并计算完
-> 气泡是源自计算时间太短了,所以将多个batches一起进行推理,可以增加计算时间,从而减少气泡。
-> 将多个batch放一起可能激活的专家数变多了,可能依旧会有一部分专家还需要等待Load才能计算(图1 b)。
-> 专家也存在冷热,所以先进行热专家的计算,再进行冷专家的计算。
Loading时的overlap策略在Dense Model中效果更好
稻草人策略
- a:传统的offloading策略
- b:被Flexgen启发的多批次策略,这样就可以load一次模型后同时计算多个批次了
- ❌一次性load整个模型,没有按Expert划分
- c:判断,然后预取部分专家进行计算
- ❔怎么判断
- ❌E1没预取成功时,按计算顺序执行依旧会带来阻塞 -> 调整计算顺序
判断稀疏性
[23] Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed MoE training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 945–959.
[24] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. 2024. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024).
专家存在稀疏性
Overview
Expert-aware Multi-batch Pipeline Paradigm
minimizing inter-layer bubbles(Attn计算完,Expert未通信完)
- 只取部分的Experts而不是所有
- 算完Gate发现Expert不在则预取
minimizing intra-layer bubbles(Expert计算完,下个Expert还没通信完)
- 按Hot到Cold的专家进行计算,同时计算完的专家则进行卸载,减少GPU内存使用量。
- 批次处理
Correlation-aware Expert Prefetcher
记录哪些专家应该被先预取
Constraint-Sensitive I/O-Compute Planner
批次大小的确定
- Ⅰ:Gate计算开始
- Ⅱ:Hot Experts计算开始
- Ⅲ:Cold Experts计算开始
- Ⅳ:下一层Attention的开始
- $tc$:计算时间,$t{I/O}$:IO时间
- A(Attention),G(Gate),hot-E(hot experts),E_i(第i个expert)
准则:
- Attention计算完成前时Gate传输要完成
- Gate完成前Hot Experts传输要完成
- Hot Experts算完前,第一个Cold Expert要完成
- 其他需要激活的Expert算完前,下一层Attention的传输要完成(Q大小是基于profile设定的,每一层不一样)
读后感
基于overlap背景的实验可以获得很好的效果,也很适合硬件不充裕的小学术界研究,但实际业务可能不太重视这类型?